Columbia Kermit

home *** CD-ROM | disk | FTP | other *** search

/ Columbia Kermit / kermit.zip / newsgroups / misc.19950726-19950929 / 000292_news@columbia.edu_Mon Sep 4 11:00:19 1995.msg < prev next >

Wrap

Internet Message Format | 2020-01-01 | 6KB

Received: from apakabar.cc.columbia.edu by watsun.cc.columbia.edu with SMTP id AA29413 (5.65c+CU/IDA-1.4.4/HLK for <kermit.misc@watsun.cc.columbia.edu>); Mon, 4 Sep 1995 19:15:59 -0400 Received: by apakabar.cc.columbia.edu id AA26350 (5.65c+CU/IDA-1.4.4/HLK for kermit.misc@watsun); Mon, 4 Sep 1995 19:15:57 -0400 Path: news.columbia.edu!sol.ctr.columbia.edu!howland.reston.ans.net!agate!dog.ee.lbl.gov!news.cs.utah.edu!cc.usu.edu!jrd From: jrd@cc.usu.edu (Joe Doupnik) Newsgroups: comp.protocols.kermit.misc Subject: Re: MS-KERMIT 3.14 hanging on idle TCP/IP connection? Message-Id: <1995Sep4.170019.60531@cc.usu.edu> Date: 4 Sep 95 17:00:19 MDT References: <42d2u9$edt@apakabar.cc.columbia.edu> <2979@sun3.IPSWITCH.COM> Organization: Utah State University Lines: 97 Apparently-To: kermit.misc@watsun.cc.columbia.edu Supplementing Dan's decent advice... In article <2979@sun3.IPSWITCH.COM>, ddl@harvard.edu (Dan Lanciani) writes: > In article <42dodl$go@apakabar.cc.columbia.edu>, chaiklin@konichiwa.cc.columbia.edu (Seth Chaiklin) writes: > | > | Joe Doupnik <jrd@cc.usu.edu> wrote: > | > Did you have a chance to look at the ARP cache on the Linux machine? > | >I've heard rumors (I don't use Linux) that it times out and can yield just > | >the effects noted. You might try pinging MSK from the Linux end as one way > | >of correcting its ARP cache. > | > | You are definitely on the right track (and thanks for the fast response!). > | > | I tried an experiment. I let the MSK machine sit idle while > | connected to the Linux machine, and after 10 minutes (while true; > | do date; arp -a; sleep 60; done), I discovered that the Linux arp > | cache loses the HW address of the ethernet card, at which point, > | of course, the MSK machine appears to be frozen. > > Note that most implementations intentionally time out ARP entries; this > is a feature. I doubt that the entry is lost as such, though timeouts > are usually a bit longer. You may be looking at an ARP bug in Linux > or kermit involving bad behavior when one side already knows the address. > These kinds of bugs come up more often than you might imagine since > the ARP process for mainly-client programs is usually one way and the > reverse process may be only lightly tested. Keep in mind that the answerer > of an ARP request also retains the address of the caller to avoid sending > an ARP itself. Starting with both machines ignorant of the hardware > addresses, the process might go like this: > > kermit -> ARP-REQUEST -> Linux (saves kermit's hardware address) > Linux -> ARP-RESPONSE -> kermit > > Since this is the most common sequence, kermit probably doesn't have to > answer ARP requests at all most of the time. True, but MSK does answer ARPs all the time the TCP stack is active (while there is a session going). MSK does regular ARP caching too, but it does not timeout the entries for the currently used remote hosts. The bug seems to be in Linux. > | I tried pinging the MSK machine from the Linux machine, but it > | does not respond. However, if I hand-entered the HW address for > | the MSK machine, then deleted this entry from the arp cache, and > | then added it again, I could reestablish input/output being shown > | on the MSK machine, and everything seems to work as it should. > > You'd need a network trace to be sure, but this suggests that kermit > isn't responding to ARPs in its current state. (It could also be that > Linux isn't sending them at all, but that would be such a devastating > error that it would have been noticed long ago. I hope.) I think there > are at least two additional experiments that might shed light on the situation. > First, while in the bad state, try to ping it from another machine that > has never been involved with the connection at all. This should tell > you whether kermit is willing to respond to anybody's ARP at this point. > If it doesn't respond then it has somehow been corrupted (or doesn't > respond to ARPs in general). If it does respond then it may be that kermit > has a problem answering ARPs when it already knows the peer's hardware > address. If it does not respond, move on to the next test: > > Start kermit fresh and don't connect to anything (I assume you can do this > and still have the tcp running?). Now try to ping kermit from a machine > which has no ARP entry for kermit. If this works and the first test failed This won't work as intended because MSK does not run its TCP/IP stack until a session has started (or is starting). After all, why run a TCP/IP stack if it's not being used? Probing MSK and the Linux box from a third uninvolved machine is a good thing to do, however. Ping and traceroute both work with MSK. Something else to keep in mind is another station coming on the air with MSK's IP number. That clobbers ARP caches. MSK 3.14 checks for another station using it's IP address when the TCP/IP stack is started. (It ARPs for its own IP number and declares any response an imposter. I had to insert an special case to work around echoing by NDIS drivers.) > then it is likely the program is becoming corrupted somehow. If the second > test fails then kermit doesn't respond to ARPs at all (seems unlikely) or > you have some obscure problem with broadcasts and/or frame types that is > blocking ARPs in one direction. (Don't laugh; I've seen it.) I'm not laughing either. I've encountered worse. > The general idea is that things can work remarkably well with ARPs functioning > in only one direction and it takes something like a short cache timeout to > bring the problem to light. Consider that one end could be totally incapable > of receiving broadcasts (bad NIC, bad driver, etc.) and it would still appear > to function normally as long as it always ARP'ed first and the peer had a long > timeout. There are still some rather unusual (eg, wierd) situations where ARP responses simply don't get through. I don't know why, but they don't. One always suspects one piece of code or another and probably one does have a problem, but the saying "it does not happen here" applies to make diagnosis difficult. Joe D. > Dan Lanciani > ddl@harvard.*